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The symbolic manipulation program FORM is specialized to handle very large algebraic expres- 
sions. Some specific features of its internal structure make FORM very well suited for paralleliza- 
tion. 

We have now two parallel versions of FORM, one is based on POSIX threads and is optimal for 
modern multicore computers while another one uses MPI and can be used to parallelize FORM 
on clusters and Massive Parallel Processing systems. Most existing FORM programs will be able 
to take advantage of the parallel execution without the need for modifications. 
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1. Introduction 

The symbolic manipulation system FORM [|T]] which is available already more than 20 years, 
is specialized to handle very large algebraic expressions of billions of terms in an efficient and 
reliable way. It is widely used, in particular in the framework of perturbative Quantum Field 
Theory, where sometimes hundreds of thousands of Feynman diagrams have to be computed; most 
of the spectacular calculations of refs [^, ^] would hardly have been possible with other available 
systems. However, the abilities of FORM are also quite useful in other fields of science where the 
manipulation of huge expressions is necessary. 

Parallelization is one of the most efficient ways to increase performance. Some internal 
specifics iQ] make FORM very well suitable for parallelization so the idea to parallelize FORM 
is quite natural. 

2. General concepts and models in use 

The general concept of FORM parallelization is as follows [Q, ||, ^]: upon the startup, the 
program launches a master and several workers. FORM treats each expression individually, which 
allows the master to split incoming expressions into independent chunks. Each chunk is processed 
by workers in parallel, and then the master collects the results. 

At present, we have two different models [§, ^]: in ParFORM [Q] the master and workers are 
independent processes communicating via MPI^ and in TFORM master and workers are separate 
threads^ of a multithreaded process. 

Both models require almost no special efforts for parallel programming, all FORM programs 
may be executed in parallel without any changings. The user may give FORM some hints of how to 
parallelize some things better; these hints are simply ignored by the sequential version of FORM. 

Since TFORM uses common address space, it is runnable only on SMP computers. On the 
other hand, sometimes it permits more efficient parallelization, and it does not depend on MPI 
which make it much easier for deployment. ParFORM can be used not only on SMP computers 
but also in clusters and Massive Parallel Processors (MPP). 

3. Performance 

Both ParFORM and TFORM demonstrate approximately the same speedup [||, |6|]. Here we 
discuss TFORM running the Multiple Zeta Value program on the computer "qftquadS" at DES Y. 
The computer has 96 GB of main memory and 8 independent CPU cores; the effective number of 
CPU cores is 16 due to hyperthreading. The results are given in Fig. |T| 

For reference, the run with FORM (the sequential version) took 57078 sec. 

We see three regions: first, the speedup is almost linear up to 8 workers; second, the speedup 
is also almost linear in the range of 8-16 workers but with much less slope, and after 16 workers 
we observe a saturation. When we looked at the total amount of CPU time used, Fig.^ we see the 
total CPU time is more or less constant up to 8 workers and above 16 workers. In the range of 

'a Message Passing Interface, see http://www.mpi-fomm.org/ 
^TFORM uses POSIX threads, or pthreads 
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Figure 1: Running times of the Multiple Zeta Value TFORM program. The runs were for weight 23, up to 
depth 7. 
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Figure 2: Total CPU time of the Multiple Zeta Value TFORM program. 

8-16 workers however it increases steadily. This is responsible for the slower decline in real time in 
the first graph, because the pseudo efficiency (total CPU time divided by real time and divided by 
number of workers) remains more or less the same in this range. This is behaviour that is typical 
for hyperthreading. The total amount of work that can be obtained from this computer is about 9.5 
times the amount that can be obtained from a single core. 

The analysis of the data reveals also that TFORM needs about 20% overhead for the Multiple 
Zeta Program. This is more than for programs like Mincer. This may be due to the use of brackets 
from the master expression which may involve copious use of locks. This is still not completely 
clear though. The result is that for 8 workers the pseudo speedup (total CPU time divided by 
realtime) is 7.63 while the real speedup (compared to the FORM run) is 6.22. Of course, this is 
still very good. The maximum improvement we obtained was 7.45 for a run with 17 workers. 

4. Recent development 

Over the past years parallel FORM versions have picked up a number of new features: 

• Dollar variables. By default, both ParFORM and TFORM switch into the sequential mode 
for each module which gives dollar variables a value during execution. But there are common 
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cases when some dollar variables obtained from each term in each chunk can be processed in 
parallel in order to get a minimum value, a maximum, or a sum of results. Also, sometimes 
at the end of the processing of a term the value of the dollar variable is not important at all. 
Hence new module options have been implemented to help FORM to process these variables 
in parallel: minimum, maximum, sum and local. 

• Right-hand side expressions (RHS). This is not a problem for TFORM since all threads 
work with the same file system while it is a big problem for ParFORM since the expres- 
sion may be situated in a scratch file but different nodes may have independent scratch file 
systems. For a long time ParFORM forced evaluation of modules with RHS expressions in 
sequential mode. Now ParFORM is able to perform RHS expressions in a real parallel mode. 

• InParallel statement. A new statement was inplemented, inparallel; . This statement 
allows the execution of complete expressions in a single worker simultaneously. This is 
really useful when there are many short expressions, sometimes it gives a significant increase 
in efficiency. 

In Fig. |3| we summarize the speedup curves for the TFORM running the MZV program on 8 CPU 
cores computer when various features are switched off/on. The legend is the following: 
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Figure 3: Results of the MZV program runs with various features switched off/on. The runs were for 
weight 20, up to depth 8. 



• All par - all above mentioned features are implemented; 

• RHSseq - modules with RHS expressions are forced into the sequential mode; 

• NoDol - modules with dollar variables are forced into the sequential mode; 

• Nolnar - no InParallel statements; 

• NoInPar,NoDol - modules with dollar variables are forced into the sequential mode, no 
InParallel statements; 

• RHSseq,NoInPar modules with RHS expressions are forced into the sequential mode, no 
InParallel statements. 
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As we can see, all these new features are really important. 

If FORM programs have to run for a long time the reliability of the hardware or of the software 
infrastructure becomes a critical issue. Program termination due to unforeseen failures may waste 
days or weeks of invested execution time. The checkpoint mechanism was introduced to protect 
long running FORM programs as good as possible from such accidental interruptions. With acti- 
vated checkpoints FORM will save its internal state and data from time to time on the hard disk. 
This data then allows a recovery from a crash. The parallel FORM versions support this mechanism 
as well. 

By default, data are saved at the end of each module. Usually this is too expensive. Optionally, 
the data may be saved only after some time interval. The scalability for ParFORM running BAICER 
N=16 for different intervals between checkpoints is depicted in Fig. H. As one can see, even very 
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Figure 4: Absolute time and speedup curves for the test program BAICER without checkpoint mechanism 
("NoChck"), checkpoints every 30 minutes ("30 min") and every 10 minutes("10 min"). 



frequent checkpoints do not affect performance much. 
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